Classifying and Segmenting Classical and Modern Standard Arabic using Minimum Cross-Entropy

نویسندگان

  • Ibrahim S Alkhazi
  • William J. Teahan
چکیده

Text classification is the process of assigning a text or a document to various predefined classes or categories to reflect their contents. With the rapid growth of Arabic text on the Web, studies that address the problems of classification and segmentation of the Arabic language are limited compared to other languages, most of which implement word-based and feature extraction algorithms. This paper adopts a PPM character-based compression scheme to classify and segment Classical Arabic (CA) and Modern Standard Arabic (MSA) texts. An initial experiment using the PPM classification method on samples of text resulted in an accuracy of 95.5%, an average precision of 0.958, an average recall of 0.955 and an average Fmeasure of 0.954, using the concept of minimum cross-entropy. PPM-based classification experiments on standard Arabic corpora showed that they contained different types of text (CA or MSA), or a mixture of the both (CA and MSA). Further experiments with the same corpora showed that a more accurate picture of the contents of the corpora was possible using the PPM-based segmentation method. Tag-based compression experiments (using tags produced by parts-of-speech Arabic taggers) also showed that the quality of the tagging (as measured by compression quality) is significantly affected when tagging either CA and MSA text. The conclusion is that NLP applications (such as taggers) should treat these texts separately and use different training data for each or process them differently. Keywords—text classification; Arabic language; Classical Arabic; Modern Standard Arabic

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploiting Out-of-Domain Data Sources for Dialectal Arabic Statistical Machine Translation

Statistical machine translation for dialectal Arabic is characterized by a lack of data since data acquisition involves the transcription and translation of spoken language. In this study we develop techniques for extracting parallel data for one particular dialect of Arabic (Iraqi Arabic) from out-ofdomain corpora in different dialects of Arabic or in Modern Standard Arabic. We compare two dif...

متن کامل

روشی جدید جهت استخراج موجودیت‌های اسمی در عربی کلاسیک

In Natural Language Processing (NLP) studies, developing resources and tools makes a contribution to extension and effectiveness of researches in each language. In recent years, Arabic Named Entity Recognition (ANER) has been considered by NLP researchers due to a significant impact on improving other NLP tasks such as Machine translation, Information retrieval, question answering, query result...

متن کامل

Predicting Phrase Breaks in Classical and Modern Standard Arabic Text

We train and test two probabilistic taggers for Arabic phrase break prediction on a purpose-built, “gold standard”, boundary-annotated and PoS-tagged Qur‟an corpus of 77430 words and 8230 sentences. In a related LREC paper (Brierley et al., 2012), we cover dataset build. Here we report on comparative experiments with off-the-shelf N-gram and HMM taggers and coarse-grained feature sets for synta...

متن کامل

Weighted Entropy Cortical Algorithms for Modern Standard Arabic Speech Recognition

Cortical algorithms (CA) inspired by and modeled after the human cortex, have shown superior accuracy in few machine learning applications. However, CA have not been extensively implemented for speech recognition applications, in particular the Arabic language. Motivated to apply CA to Arabic speech recognition, we present in this paper an improved CA that is efficiently trained using an entrop...

متن کامل

Mani’s Living Gospel: A New Approach to the Arabic and Classical New Persian Testimonia

In order to reconstruct the contents of the most famous work of Mani, Living Gospel (written originally in Syriac), we have to use the Arabic and Classical New Persian texts containing accounts and even indirect quotations of this book. One of the most remarkable points in these accounts is that they clearly show that an important part of the Living Gospel contains the Manicha...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017